AITopics

Country:

North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Neural Information Processing SystemsFeb-7-2026, 23:54:53 GMT

309fee4e541e51de2e41f21bebb342aa-Supplemental.pdf

coefficient, commitment loss coefficient, loss coefficient, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Neural Information Processing SystemsOct-3-2025, 07:37:20 GMT

Sharp Representation Theorems for ReLU Networks with Precise Dependence on Depth

Note that decay of a function's Fourier transform is well-known to be related to its smoothness (c.f., [

fourier transform, neural network, relu network, (13 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsOct-2-2025, 14:27:59 GMT

A Additional HQA Results Table 5: Additional CelebA interpolations of the HQA encoder output z

Compression is from 98,304 to 576 bits (171x compression). Compression is from 98,304 to 144 bits (683x compression). The far left and right images are originals. B.1 Motivation In this section we outline the probabilistic model that motivates the HQA loss: L = log p (x | z = k) H [ q ( z |x)] + E A desired property of the HQA, motivated in Section 4.4, is the non-deterministic posterior We contrast these two models in Figure 8. This model is a V ariational Autoencoder with a simple Mixture of Gaussians prior.

artificial intelligence, commitment loss coefficient, machine learning, (12 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

arXiv.org Artificial IntelligenceJun-23-2025

Two Heads Are Better than One: Simulating Large Transformers with Small Ones

Yu, Hantao, Alman, Josh

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences? In this paper, we show that transformers with long input sequences (large transformers) can be efficiently simulated by transformers that can only take short input sequences (small transformers). Specifically, we prove that any transformer with input length $N$ can be efficiently simulated by only $O((N/M)^2)$ transformers with input length $M \ll N$, and that this cannot be improved in the worst case. However, we then prove that in various natural scenarios including average-case inputs, sliding window masking and attention sinks, the optimal number $O(N/M)$ of small transformers suffice.

large language model, machine learning, natural language, (19 more...)

2506.1222

Country: Europe > Austria (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Beck, Maximilian, Pöppel, Korbinian, Lippe, Phillip, Kurle, Richard, Blies, Patrick M., Klambauer, Günter, Böck, Sebastian, Hochreiter, Sepp

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

arXiv.org Artificial IntelligenceMar-17-2025

Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

large language model, machine learning, natural language, (20 more...)

2503.13427

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Austria > Upper Austria > Linz (0.04)
Asia > Middle East > Jordan (0.04)
(4 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

New ScientistMar-12-2025, 16:00:42 GMT

Metals can be squeezed into sheets just a few atoms thick

Sheets of metal just two atoms thick can be produced by squashing molten droplets at great pressure between two sapphires. The researchers who developed the process say the unusual materials could have applications in industrial chemistry, optics and computers. Last year, scientists created a gold sheet that was a single atom thick, which they dubbed "goldene" after graphene, a material made of a single layer of carbon atoms. Such materials have been described as two-dimensional, as they are as thin as chemically possible. But making other 2D metals hadn't been possible until now. The new technique, developed by Luojun Du at the Chinese Academy of Sciences and his colleagues, can create 2D sheets of bismuth, gallium, indium, tin and lead that are as thin as their atomic bonds allow.

artificial intelligence, bismuth, sheet just, (8 more...)

New Scientist

Industry: Materials > Containers & Packaging (0.40)

Technology: Information Technology > Artificial Intelligence (0.38)

arXiv.org Artificial IntelligenceMar-11-2025

Route Sparse Autoencoder to Interpret Large Language Models

Shi, Wei, Li, Sihang, Liang, Tao, Wan, Mingyang, Ma, Gojun, Wang, Xiang, He, Xiangnan

Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at https://github.com/swei2001/RouteSAEs.

activation, route sparse autoencoder, sae, (13 more...)

2503.082

Country:

Europe (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)
South America (0.04)
(9 more...)

Genre: Research Report > New Finding (0.66)

Industry: Government > Foreign Policy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Lawson, Tim, Farnik, Lucy, Houghton, Conor, Aitchison, Laurence

Residual Stream Analysis with Multi-Layer SAEs

arXiv.org Artificial IntelligenceSep-6-2024

Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, standard SAEs are trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer simultaneously. The residual stream is usually understood as preserving information across layers, so we expected to, and did, find individual SAE features that are active at multiple layers. Interestingly, while a single SAE feature is active at different layers for different prompts, for a single prompt, we find that a single feature is far more likely to be active at a single layer. For larger underlying models, we find that the cosine similarities between adjacent layers in the residual stream are higher, so we expect more features to be active at multiple layers. These results show that MLSAEs are a promising method to study information flow in transformers.

large language model, machine learning, natural language, (18 more...)

2409.04185

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre:

Research Report > Promising Solution (0.54)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Rende, Riccardo, Gerace, Federica, Laio, Alessandro, Goldt, Sebastian

What does self-attention learn from Masked Language Modelling?

arXiv.org Machine LearningDec-14-2023

Transformers are neural networks which revolutionised natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modelling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalised Potts model with interactions between sites and Potts colours. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalisation error of self-attention in a model scenario analytically using the replica method.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2304.07235

Country:

North America > United States (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)